Skip to main content

This item appears to not have any files that we can let you "experience" (like watching a video or viewing images) in this area.

We suggest you try the [DOWNLOAD OPTIONS] area to the right below to see if there are any files you would like to try to use or download.

Google Ngrams - English 5grams


Published September 27, 2011


This item contains the Google ngram data for the American English languageset.


​​​​​

Here are the datasets backing the Google Books Ngram Viewer. These
datasets were generated in July 2009; we will update these datasets as
our book scanning continues, and the updated versions will have
distinct and persistent version identifiers (20090715 for the current
set).



Each of the numbered links below will directly download a fragment of the
given corpus. For instance, the first ten links below
collectively comprise the 1-gram (i.e., individual words) counts for
English, as collected from Google's scanned books around July 15,
2009. In addition, for each corpus we provide the file total counts,
which records the total number of 1-grams contained in the books that make up the corpus.
This file is useful to compute the relative frequencies of n-grams.



Details on the corpus construction can be found in the
Science article
written by Jean-Baptiste Michel et al. but are
abbreviated here. Of note, we report only
the n-grams that appeared over 40 times in the whole corpus. Therefore, the
sum of the 1-gram occurences in any given corpus is smaller than the number
given in the total counts file.




File format: Each of the numbered files below is
zipped tab-separated data. (Yes, we know the files have .csv
extensions.) Each line has the following format:



ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE



As an example, here are the 30,000,000th and 30,000,001st lines from file 0 of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):

circumvallate   1978   313    215   85
circumvallate 1979 183 147 77


The first line tells us that in 1978, the word "circumvallate"
(which means "surround with a rampart or other fortification", in case
you were wondering) occurred 313 times overall, on 215 distinct pages
and in 85 distinct books from our sample.



The format of the total counts file is identical, except that the ngram field is absent: there is only one triplet of values (match_count, page_count, volume_count) per year.



Here's the 9,000,000th line from file 0 of the English 5-grams (googlebooks-eng-all-5gram-20090715-0.csv.zip):

analysis is often described as  1991  1   1   1


In 1991, the phrase "analysis is often described as" occurred one time
(that's the first 1), and on one page (the second 1), and in one book
(the third 1).



Inside each file the ngrams are sorted alphabetically and then
chronologically. Note that the files themselves aren't ordered
with respect to one another.
A French two word phrase starting
with 'm' will be in the middle of one of the French 2gram files, but
there's no way to know which without checking them all.



If datasets aren't yet complete, that means we're still busy uploading them.
They'll be available soon.



Usage: This compilation is licensed under a Creative Commons Attribution 3.0 Unported License.




Identifier google_ngrams-eng-all-5grams
Creator Google, Inc.
Mediatype web
Date 2011-09-27 05:49:03
Year 2011
Licenseurl http://creativecommons.org/licenses/by/3.0/
Publicdate 2011-09-27 05:49:03

Reviews

Reviewer: zbynekT - - December 26, 2013
Subject: download problem
How can I download some files? I always get 'Item not available
The item is not available due to issues with the item's content. '

Thank you.
DOWNLOAD OPTIONS
In Collection
google ngrams
In Collection
Data Collection
In Collection
Internet Archive Web Crawls
In Collection
Web Crawls
Uploaded by
underscor
on 9/27/2011
Views
12
Reviews
1
PEOPLE ALSO FOUND
google ngrams
by Google, Inc.
86
0
0
google ngrams
by Google, Inc.
4
0
0
google ngrams
by Google, Inc.
5
0
0
google ngrams
by Google, Inc.
4
0
0
google ngrams
by Google, Inc.
11
0
0
google ngrams
by Google, Inc.
3
0
0
google ngrams
by Google, Inc.
4
0
0
google ngrams
by Google, Inc.
15
0
0
google ngrams
by Google, Inc.
4
0
0